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Abstract. It is argued that the Calibrated Bayesian (CB) approach to 
statistical inference capitalizes on the strength of Bayesian and frequen- 
tist approaches to statistical inference. In the CB approach, inferences 
under a particular model are Bayesian, but frequentist methods are 
useful for model development and model checking. In this article the 
CB approach is outlined. Bayesian methods for missing data are then 
reviewed from a CB perspective. The basic theory of the Bayesian ap- 
proach, and the closely related technique of multiple imputation, is 
described. Then applications of the Bayesian approach to normal mod- 
els are described, both for monotone and nonmonotone missing data 
patterns. Sequential Regression Multivariate Imputation and Penalized 
Spline of Propensity Models are presented as two useful approaches for 
relaxing distributional assumptions. 

Key words and phrases: Maximum likelihood, multiple imputation, 
penalized splines, propensity models, sequential regression multivariate 
imputation. 



1. INTRODUCTION 

There was clearly a time, perhaps not too far in 
the recent past, when Bayesian methods were con- 
sidered "beyond the pail" by frequentist statisti- 
cians. But Bayesian methods have been resurgent 
in recent years, to the extent that few statisticians 
have no interest in them, even if they do not buy 
the complete philosophical package. 

In this article I summarize my perspective on the 
role of Bayesian methods in statistics, borrowing 
from a more extensive discussion in Little (2006). 
I then provide a brief overview of Bayesian infer- 
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ence for missing data problems, both modeling and 
ignoring the missing data mechanism, and multi- 
ple imputation (MI), an important practical tool 
for dealing with missing data that has a Bayesian 
etiology. Finally, I give some examples of Bayesian 
missing-data methods which I believe frequentists 
could profitably add to their analytical toolkit. 

Bayesian methods are particularly useful for han- 
dling missing data problems in statistics. Incomplete 
data problems are readily amenable to likelihood- 
based methods, since they do not require a rect- 
angular data matrix. Maximum likelihood (ML) is 
an important approach, but the loglikelihoods cor- 
responding to missing data problems are typically 
more complex than likelihoods for complete data, 
deviate more from the quadratic approximations that 
underlie asymptotic inferences, and are subject to 
under-identification or weak identification of param- 
eters. Consequently, ML requires iterative calcula- 
tions, information matrix-based standard errors are 
often difficult to compute in high-dimensional prob- 
lems, and asymptotic ML inferences can have seri- 
ous deficiencies, particularly with small fragmentary 
samples. In contrast, draws from the Bayesian pos- 
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terior distribution can often be computed using di- 
rect simulation or Markov Chain Monte Carlo tech- 
niques, and these provide estimates of standard er- 
rors without the need to compute and invert the in- 
formation matrix. The inferences based on the pos- 
terior distribution often have better frequentist prop- 
erties than asymptotic inferences based on ML. Fur- 
thermore, multiple imputations can be generated as 
a byproduct of Bayesian calculations, and provide 
a practically useful and flexible tool for solving miss- 
ing data problems. These points are further devel- 
oped in the material that follows. 

2. WHY BAYES? THE CALIBRATED BAYES 
PHILOSOPHY 

The statistics world is still largely divided into 
frequentists, who base inference for an unknown pa- 
rameter 6 on hypothesis tests or confidence inter- 
vals derived from the distribution of statistics in 
repeated sampling, and Bayesians, who formulate 
a model for the data and prior distribution for un- 
known parameters, and base inferences for unknowns 
on posterior distributions. Bayesians are also "sub- 
jective," as when proper priors are elicited, and "ob- 
jective," as when conventional "reference priors" are 
adopted. Both these facets of the Bayesian paradigm 
have useful roles, depending on context. Asymptotic 
maximum likelihood inference can be seen as a form 
of large sample Bayes, with the interval for 9 being 
interpreted posterior credibility interval rather 
than a confidence interval. 

I believe both systems of statistical inference have 
strengths and weaknesses and, hence, the best course 
is to seek a compromise that combines them in a way 
that capitalizes on their strengths. The Bayesian 
paradigm is best suited for making statistical infer- 
ence under an assumed model. Indeed, under full 
probability modeling, with prior distributions as- 
signed to parameters, the Bayes theorem is indeed 
a theorem — the path determined by probability the- 
ory to inference about unknowns given the data, 
whether the targets are parameters or predictions 
of unobserved quantities. 

The frequentist approach, on the other hand, has 
various well-known limitations regarding inference 
under an assumed model. First, it is not prescriptive: 
frequentist theory seems more like a set of concepts 
for assessing properties of inference procedures, ra- 
ther than an inferential system per se. Under an 
agreed model, and assuming large samples, there 
is a relatively prescriptive path to inference based 



on maximum likelihood (ML) estimates and their 
large-sample distribution. However, other frequen- 
tist approaches are entertained in practice, like gen- 
eralized estimating equations, based on robustness 
or other considerations. Also, there is no prescrip- 
tive frequentist approach to small sample problems. 
Indeed, for many problems, such as the Behrens- 
Fisher problem of comparing two means of normal 
distributions with different unknown variances, no 
procedure exists that has exact repeated-sampling 
properties, such as exact nominal confidence cover- 
age for all values of the unknown parameter. Baye- 
sian methods provide exact frequentist coverage for 
some complete-data problems — this occurs, in par- 
ticular, in problems where the Bayesian and frequen- 
tist inferences are the same, as in t inference for 
normal multiple regression with a uniform prior on 
the regression coefficients and log variance. For more 
complex problems, including problems with miss- 
ing data, Bayesian methods do not generally pro- 
vide exact frequentist coverage, but they often im- 
prove on ML by providing better small-sample in- 
ferences, perhaps because Bayesian model shrinkage 
moderates inferences based on extreme parameters 
estimates. As just one example, consider the adjust- 
ment of estimates for categorical data motivated by 
Bayesian ideas (Agresti, 2002). 

The Bayesian approach is prescriptive in the sense 
that, once a model and prior distribution are spec- 
ified, there is a clear path to inferences based on 
the posterior distribution, or optimal estimates for 
a given choice of loss function. There is no prescrip- 
tion for choosing the model and prior distribution — 
that is what makes applied statistics interesting — 
but certain "reference" prior distributions for com- 
plete-data problems can be expected to yield good 
frequentist properties when applied to missing data 
problems; see, for example, Little (1988). 

Frequentist inference violates the likelihood prin- 
ciple, and is ambiguous about whether to condi- 
tion on ancillary or approximately ancillary statis- 
tics when performing repeated sampling calculations. 
Little (2006) provides more discussion, with exam- 
ples. 

An attractive feature of Bayesian methods in com- 
plete-data or missing-data problems is the treat- 
ment of nuisance parameters. Bayesian inference in- 
tegrates over these parameters, rather than fixing 
them at their ML estimates. This tends to yield in- 
ferences with improved frequentist properties, since 
the uncertainty about these parameters is taken into 
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account. For example, for complete or incomplete 
data problems, restricted ML, which integrates over 
location parameters, is generally viewed as superior 
to ML, which maximizes them. 

If we were handed the model on a plate and told to 
do inference for unknowns, then Bayesian statistics 
is the clear winner. The problem, of course, is that 
we never know the true model. Bayesian inference 
requires and relies on a high degree of model speci- 
fication (Efron, 1986) — full specification of a likeli- 
hood and prior. Developing a good model is chal- 
lenging, particularly in complex problems. Further- 
more, all models are wrong, and bad models lead 
to bad answers: under the frequentist paradigm, the 
search for procedures with good frequentist proper- 
ties provides some degree of protection against model 
misspecification, but there seems no such built-in 
protection under a strict Bayesian paradigm where 
frequentist calculations are not entertained. 

Good principles for picking models are essential, 
and here I feel frequentist methods have an impor- 
tant role. We want models that yield inferences with 
good frequentist properties, such as 95% credibility 
intervals that cover the unknown parameter 95% of 
the time if the procedure was applied to repeated 
samples. The Bayesian has some tools for model 
development and checking, like Bayes factors and 
model averaging, but Bayesian hypothesis testing 
has well known problems, and, in my view, frequen- 
tist approaches are essential when it comes to model 
development and assessment. 

To summarize, Bayesian statistics is strong for in- 
ference under an assumed model, but relatively weak 
for the development and assessment of models. Fre- 
quentist statistics provides useful tools for model de- 
velopment and assessment, but has weaknesses for 
inference under an assumed model. If this summary 
is accepted, then the natural compromise is to use 
frequentist methods for model development and as- 
sessment, and Bayesian methods for inference under 
a model. This capitalizes on the strengths of both 
paradigms, and is the essence of the approach known 
as Calibrated Bayes (CB). 

Many statisticians have advanced CB ideas (e.g., 
Peers, 1965; Welch, 1965; Dawid, 1982), but I was 
particularly influenced by seminal papers by Box 
(1980) and Rubin (1984). Box (1980) wrote, 

"I believe that. . . sampling theory is needed 
for exploration and ultimate criticism of 
the entertained model in the light of the 



current data, while Bayes' theory is needed 
for estimation of parameters conditional 
on adequacy of the model." 

He based his implementation of this idea on the fac- 
torization: 

p{Y,e\Model) = p(Y\Model)p(6\Y, Model), 

where the second term on the right side is the pos- 
terior distribution of the parameter given data Y 
and Model, and is the basis for inference, and the 
first term on the right side is the marginal distribu- 
tion of the data Y under the Model, and is used 
to assess the validity of the Model, with the aid 
of frequentist considerations. Specifically, discrep- 
ancy functions of the observed data d(Y) are as- 
sessed from the perspective of realizations from their 
marginal distribution p(d(Y)|Model). A question- 
able feature of this "prior predictive checking" is 
that checks are sensitive to the choice of prior dis- 
tribution even when this choice has limited impact 
on the posterior inference; in particular, it leads to 
problems with assessment of models involving non- 
informative priors. 
Rubin (1984) wrote, 

"The applied statistician should be Baye- 
sian in principle and calibrated to the real 
world in practice — appropriate frequency 
calculations help to define such a tie. . . 
frequency calculations are useful for mak- 
ing Bayesian statements scientific, scien- 
tific in the sense of capable of being shown 
wrong by empirical test; here the tech- 
nique is the calibration of Bayesian proba- 
bilities to the frequencies of actual events." 

Rubin (1984) advocated model checking based on 
a different distribution, namely, p(Y* ,6*\Y, Model), 
the predictive distribution of future data Y* and 
parameter values 6* given the Model and observed 
data Y. This leads to posterior predictive checks 
(Rubin, 1984; Gelman, Meng and Stern, 1996), which 
extend frequentist checking methods by not limiting 
attention to checking statistics that have a known 
distribution under the model. These checks involve 
an amalgam of Bayesian and frequentist ideas, but 
are clearly frequentist in spirit in that they concern 
embedding the observed data within a sequence of 
unobserved data sets that could have been generated 
under the Model, and seeing whether the observed 
data are "reasonable." 

Philosophy aside, perhaps the main reason why 
Bayesian methods have flourished in recent years is 
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the development of powerful computational tools, li- 
ke the Gibbs' sampler and other Markov Chain Mon- 
te Carlo (MCMC) methods. These, together with 
gains in computing power, have made it computa- 
tionally feasible to carry out the high-dimensional 
integrations required. An important early breakt- 
hrough in MCMC methods actually occurred for a 
missing data problem, as I discuss in Example 2 
below. Even if frequentists are completely against 
Bayes, they can apply these Bayesian computational 
tools with weak prior distributions, and interpret re- 
sults as approximations to ML, with similar asymp- 
totic properties. 

3. A SHORT HISTORY OF STATISTICAL 
ANALYSIS WITH MISSING DATA 

I divide the development of missing data methods 
in statistics into four eras: 

3.1 Pre-EM Algorithm (Pre-1970s) 

Early missing data methods involved complete- 
case analysis, that is, simply discarding data with 
any values missing, or simple imputation methods, 
which filled in missing values with best estimates 
and analyzed the filled-in data. The latter approach 
was developed quite extensively in the case of anal- 
ysis of variance with missing outcomes, which were 
imputed to maintain a balanced design and hence 
an easily inverted design matrix (see Little and Ru- 
bin, 2002, Chapter 2). These ingenious methods are 
now mainly of historical interest, since inverting the 
design matrix corresponding to the unbalanced data 
is not a big problem given advances in modern com- 
puting. ML methods were developed for some sim- 
ple missing data problems, notably bivariate normal 
data with missing values on one variable, which An- 
derson (1957) solved noniteratively by factoring the 
likelihood (see Example 1 below). ML for complex 
problems was iterative and generally too hard given 
limits of computation, although progress was made 
for contingency tables (Hartley, 1958) and normal 
models (Hartley and Hocking, 1971). 

3.2 The Maximum Likelihood Era 
(1970s-Mid 1980s) 

ML methods became popular and feasible in the 
mid-1970s with the development of the Expectation- 
Maximization (EM) algorithm. EM builds a link 
with complete-data ML and is simple to program 
in several important multivariate models, including 
the multivariate normal model with a general pat- 



tern of missing values. The term EM was coined 
in the famous paper by Dempster, Laird and Ru- 
bin (1977), which established some key properties 
of the method, including the fact that the likelihood 
does not decrease at each iteration. The EM algo- 
rithm had been previously discovered several times 
for particular models (e.g., McKendrick, 1926; Hart- 
ley, 1958; Baum et al., 1970), and had been formu- 
lated in some generality by Orchard and Woodbury 
(1972) and Sundberg (1974). The simplicity and ver- 
satility of EM motivated extensions of EM to handle 
more difficult problems, and applications to a vari- 
ety of complete-data models for categorical and con- 
tinuous data, as reviewed in Little and Rubin (1987), 
McLachlan and Krishnan (1997) and Meng and van 
Dyk (1997). For generalizations of the EM idea, see 
Lange (2004). 

Another important development in this era was 
the formulation of models for the missing data mech- 
anism, and associated sufficient conditions for when 
the missing data mechanism can be ignored for fre- 
quentist and Bayesian inference (Rubin, 1976). 

3.3 Bayes and Multiple Imputation 
(Mid 1980s-Present) 

The transition from ML to Bayesian methods in 
the missing data setting was initiated by Tanner 
and Wong (1987), who described data augmenta- 
tion to generate the posterior distribution of the 
parameters of the multivariate normal model with 
missing data. Data augmentation is closely related 
to the Gibbs' sampler, as discussed below. Another 
important development was Rubin's (1978, 1987, 
1996) proposal to handle missing data in public use 
data sets by multiple imputation (MI), motivated by 
Bayesian ideas. In its infancy, this proposal seemed 
very exotic and computationally impractical — not 
any more! MCMC facilitates Bayesian multiple im- 
putation, and is now implemented in publicly-availa- 
ble software for the convenience of users of both 
Bayesian and frequentist persuasions. 

3.4 Robustness Concerns 
(1990s-Present) 

Model-based missing data methods are potentially 
vulnerable to model misspecification, although they 
tend to outperform naive methods even when the 
model is misspecified (e.g., Little, 1979). Modern in- 
terest in limiting effects of model misspecification by 
adopting robust procedures has extended to miss- 
ing data problems, notably with "doubly robust" 
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procedures based on augmented inverse-probability 
weighted estimating equations (Robins, Rotnitsky 
and Zhao, 1994). From a more directly model-based 
Bayesian perspective, robustness takes the form of 
developing models that make relatively weak struc- 
tural assumptions. A method based on one such 
model, Penalized Spline of Propensity Prediction 
(PSPP, Little and An, 2004), is discussed in Exam- 
ple 4 below. Another aspect of robustness concerns 
has been more interest in model checks for standard 
missing data models (Gelman et al., 2005). 

I now sketch the likelihood and Bayesian theory 
for the analysis of data with missing values that un- 
derlies the methods in Sections 3.2 and 3.3. I then 
describe the transition from Sections 3.2 to 3.4 for 
the case of multivariate normal models, and elabo- 
rations for non-normal data. 

4. LIKELIHOOD-BASED METHODS WITH 
MISSING DATA 

Likelihoods can be defined for nonrectangular da- 
ta, so likelihood methods apply directly to missing- 
data problems: 

statistical model + incomplete data =>■ likelihood. 

Given the likelihood function, standard approaches 
are to maximize it, leading to ML, with associated 
large sample standard errors based on the informa- 
tion, the sandwich estimator or the bootstrap; or to 
add a prior distribution and compute the posterior 
distribution of the parameters. Draws from the pre- 
dictive distribution of the missing values can also be 
created as a basis for MI. 

As described in Little and Rubin (2002, Chap- 
ter 6), let Y = (yij)nxK represent a data matrix 
with n rows (cases) and K columns (variables), and 
define the missing-data indicator matrix M = 
{fnij)nxK-, with entries rriij = if yij is observed, 
rriij = 1 if yij is missing. Also, write Y = (Y ohs , Y mis ), 
where l^,bs represents the observed components of Y 
and Y m [ s the missing components. A full parametric 
model factors the distribution of (Y,M) into a dis- 
tribution f(Y\9) for Y indexed by unknown param- 
eters 9, and a distribution f(M\Y,ip) for M given Y 
indexed by unknown parameter ip. (This is called 
a selection model factorization; the alternative fac- 
torization into the marginal distribution of M and 
the conditional distribution of Y given M is called 
a pattern-mixture model.) If Y was fully observed, 



the posterior distribution of the parameters would 
be 

Pcorap\ete{0^\Y,M) = const, x tt(0,V) x L(9,l/)\Y), 

where Tr(9,ip) is the prior distribution of the param- 
eters, 

L(p,Tl>\Y) = f(Y\0)xf(M\Y,1>) 

is the complete-data likelihood and const, is a nor- 
malizing constant. With incomplete data, the full 
posterior distribution becomes 

PMi(0,ijj\Y ohs ,M) 

(1) 

octt(0,VO xL(6,i;\Y obs ,M), 

where L(9, V>|^obs> M) is the observed likelihood, ob- 
tained by integrating the missing values out of the 
complete-data likelihood: 

/(y obs ,M|0,v) 

= J f(Y obs ,Y mis \6)f(M\Y ohs ,Y mis ,i;)dY mis . 

A simpler posterior distribution of 9 ignores the 
missing data mechanism, and is based on the likeli- 
hood given the observed data l^bs : 

(2) Piga (e\Y oh8 )<xir(e)xL(9\Y ohs ), 

L(9\Y ohs ) = J f(Y ohs ,Y mis \9)dY mis , 

which does not involve the model for the distribution 
of M. 

Statistical analysis based on (2) is considerably 
easier than analysis based on (1), since (a) the model 
for the missing-data mechanism is hard to specify, 
and has a strong influence on inferences; (b) the inte- 
gration over the missing data is often easier for equa- 
tion (2) than for equation (1); and (c) the full model 
is often under-identified or poorly identified; identi- 
fication is in some ways less of an issue in Bayesian 
inference, but results rest strongly on the choice of 
prior distribution. Thus, ignoring the missing-data 
mechanism is useful if it is justified. Sufficient condi- 
tions for ignoring the missing-data mechanism and 
basing inference on (2) (Rubin, 1976; Little and Ru- 
bin, 2002, Chapter 6) are as follows: 

Missing at Random (MAR): p(M\Y ohs ,Y mis ,?p) = 

p(M\Y ohs ,i;) for all y mis , 

A-priori independence: Tr(9,ip) = ir(9) x 

Of these, MAR is the key condition in practice, and 
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it implies that missingness can depend on values in 
the data set that are observed, but not on values 
that are missing. 

The main challenges in developing posterior dis- 
tributions based on (1) or (2) are the choice of model 
and computational issues, since the likelihood based 
on the data with missing values is typically much 
more complex than the complete-data likelihood. In 
the ML world, the expectation-maximization (EM) 
algorithm creates a tie between the complicated ob- 
served data likelihood and the simpler complete- 
data likelihood, facilitating this computational task. 
Specifically, suppose the missing-data mechanism is 
ignorable, and let 6® be the current estimate of the 
parameter 9. The E-step of EM finds the expected 
complete-data loglikelihood if equaled 0^: 

Q(0\Y oh8 ,0®) 

(3) = J log/(y obs ,Y mis ;#) 

■f(Y mis \Y ohs ,9 = 0^)dY mis . 

The M-step of EM determines 6> (t+1) by maximizing 
this expected complete-data loglikelihood: 

(4) Q(# (m) |y obs ,0 (t) ) > Q(9\Y ohs ,0^) for all 0. 

In the Bayesian world, the analog of EM is data 
augmentation, a variant of the Gibbs' sampler. (An 
even closer analog to the Gibbs' sampler is the Ex- 
pectation Conditional Maximization algorithm, a va- 
riant of EM.) The key idea is to iterate between 
draws of the missing values and draws of the pa- 
rameters; draws of missing values replace expected 
values of functions of the missing values in the E- 
step of EM, and draws of the parameters replace 
maximization over the parameters in the M-step of 
EM. Specifically, suppose the missing data mecha- 
nism is ignorable, and let (Y^t , #( dt )) be current 
draws of the missing data and parameters at itera- 
tion t; here and elsewhere a superscript d is used to 
denote a draw from a distribution. Then at iteration 
t + 1, the analog of the E-step (3) of EM is to draw 
new values of the missing data: 

(5) Y^ t+1) ~ p (Y mis \Y ohs ,0^), 

and the analog of the M-step (4) is to draw a new 
set of parameters from the completed-data posterior 
distribution: 

(6) 0( d ' t+1) ~p(0\Y ohs ,Y^ +1) ). 



As t tends to infinity, this sequence converges to 
a draw (Y^l,0^) from the joint posterior distri- 
bution of YJnis and 0. Those familiar with MCMC 
methods will recognize this as an application of the 
Gibbs' sampler to the pair of variables Y m i s and 0. 
The utility lies in the fact that (5) is often facil- 
itated since the distribution conditions on the pa- 
rameters, and (6) is a complete-data problem since 
it conditions on the imputations derived from (5). 
For more discussion of Bayesian computations for 
missing data, see Tan and Tian (2010). 

5. MULTIPLE IMPUTATION 

Draws of the missing data from equation (5) 
at convergence can be used to create D > 1 multiply- 
imputed data sets. Bayesian MI combining rules can 
then be used for inferences that propagate imputa- 
tion uncertainty. 

I outline the theory when the missing data mech- 
anism is ignorable, although it readily extends to 
the case of nonignorable nonresponse. The idea is to 
relate the observed-data posterior distribution (1) 
to the "complete-data" posterior distribution that 
would have been obtained if we had observed the 
missing data Y m i S , namely, 

(7) p(0\Y obs ,Y mis ) oc tt(0) x L(0|r obs , y mis ). 

Equations (2) and (7) can be related by standard 
probability theory as 

Pign(0|Yobs) 

(8) 

= J p(9\Y ohs ,Y mis )p(Y mis \Y ohs )dY mis . 

Equation (8) implies that the posterior distribution 
Pi gn (#|Yobs) can be simulated by first drawing the 
missing values, Y^9, from their posterior distribu- 
tion, p(Ymis|Yobs)) imputing the drawn values to com- 
plete the data set, and then drawing from its 
"completed-data" posterior distribution, p(0\Y o \, s , 

Y«>). That is, 

(9) PigaO^obe) « 7jE^l y °bs, 

d=l 

When the posterior mean and variance are adequate 
summaries of the posterior distribution, (9) can be 
effectively replaced by 

(10) E(9\Y ohs ) = E[E(9\Y ohs ,Y mis )\Y ohs ], 
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and 

Var(0|Y obs ) = E[V a v(9\Y ohs ,Y mis )\Y ohs ] 

(11) 

+ Var[E(6\Y oha ,Y mis )\Y ohs }. 

Approximating (10) and (11) using draws of Y m \ s 
yields 

1 D 

(12) E(e\Y ohs )*e = -Y,0 id \ 

d=l 

where §W = E(6\Y ohs , Y%9) is the poster lor mean 
of 9 from the <ith completed data set, and 

(13) Var(fl|Y obs ) w7 + (l + 1/D)B, 

say, where V = D^ 1 ££ =1 V&i(9\Y ohs ,Y^)) is the av- 
erage of the complete-data posterior covariance ma- 
trix of 9 calculated for the dth data set (Yobs , Y^ ) , 

B = J2d=i0d - O)0d - O f/{D - 1) is the between- 
imputation covariance matrix, and (1 + 1 /D) is a cor- 
rection to improve the approximation for small D. 
The quantity (1 + 1/ D)B in (13) estimates the con- 
tribution to the variance from imputation uncer- 
tainty, missed (i.e., set to zero) by single imputation 
methods. 

Equations (12) and (13) are basic MI combining 
rules. Refinements that replace the normal reference 
distribution for scalar 9 by a Student t distribution 
are given in Rubin and Schenker (1986), with fur- 
ther small-sample t refinements in Barnard and Ru- 
bin (1999); extensions to hypothesis testing are de- 
scribed in Rubin (1987) or Little and Rubin (2002, 
Chapter 10). Besides incorporating imputation un- 
certainty, another benefit of multiple imputation is 
that the averaging over data sets in (12) results in 
more efficient point estimates than does single ran- 
dom imputation. Often MI is not much more diffi- 
cult than doing a single imputation — the additional 
computing from repeating an analysis D times is 
not a major burden and methods for combining in- 
ferences are straightforward. Most of the work is 
in generating good predictive distributions for the 
missing values. 

From a frequentist perspective, Bayesian MI for 
a parametric model has similar large-sample prop- 
erties to ML, and it can be simpler computationally. 
Another attractive feature of MI is that the imputa- 
tion model can differ from the analysis model by in- 
cluding variables not included in final analysis. Some 
examples follow: 

(a) MI was originally proposed for public use files, 
where the imputer often has variables available for 
imputation, like geography, that are not available 



to the analyst because of confidentiality constraints. 
Such variables can be included in the imputation 
model, but will not be available for analysis. In other 
settings, auxiliary variables that are not suitable for 
inclusion in the final model, such as side-effect data 
for drugs in a clinical trial, may be useful predictors 
in an imputation model. 

(b) For public use files, users perform analyses 
with different subsets of variables. Different ML anal- 
yses involving a variable with missing values imply 
different imputation models, to the extent that they 
involve different sets of variables. A more coherent 
approach is to multiply impute missing variables us- 
ing a common model, and then apply MI methods to 
each of the analyses involving subsets of variables. 
This allows variables not in the subset to help pre- 
dict the missing values. 

(c) MI combining rules can also be applied when 
the complete-data inference is not Bayesian (for ex- 
ample, nonparametric tests or design-based survey 
inference). The assumptions contained in the impu- 
tation model are then confined to the imputations, 
and with small amounts of missing data, simple im- 
putation models may suffice. 

6. APPLICATIONS OF BAYESIAN METHODS 
TO MISSING DATA PROBLEMS 

Sections 4 and 5 sketched the basic theory of Baye- 
sian inference for missing data and the related me- 
thod of MI. We now provide some examples of im- 
portant models where these methods can be put to 
practical use. We focus mainly on continuous vari- 
ables, although methods for categorical variables, 
and mixtures of continuous and categorical variables, 
are also available (Little and Rubin, 2002). 

Example 1 (Data with a monotone pattern of 
missing values). I mentioned in Section 3.1 the fac- 
tored likelihood method of Anderson (1957). Con- 
sider bivariate normal data on two variables (Y\, Y2) 
where Y\ is observed for all n observations, and Y2 
is observed for r < n observations, that is, has n — r 
missing values. Assume the missing-data mechanism 
is ignorable. The factored likelihood is obtained by 
transforming the joint normal distribution of (Yi, Y2) 
into the marginal distribution of Y±, normal with 
mean fii and variance an, and the conditional dis- 
tribution of Y2 given Y\, normal with mean fao-i + 
$2 1-1^1 an d variance 022-1 • The likelihood then fac- 
torizes into the normal likelihood for <p\ = (/ii,0n) 
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based on the n cases with Y\ observed, and the nor- 
mal likelihood for fa = (fao-i, 021-1, o"22-i) based on 
the r cases with both Y\ and Y2 observed. The ML 
estimates are immediate: the sample mean and sam- 
ple variance of Y\ (with denominator n) based on 
all n observations for fa, and the least squares es- 
timates of the regression of Y2 on Y\ (with no de- 
grees of freedom correction for 0221) based on the r 
complete cases for fa. ML estimates of other pa- 
rameters, such as the mean \i2 of Y2, are obtained 
by expressing them as functions of (fa, fa) and then 
substituting ML estimates of those parameters. In 
particular, this leads to the well-known regression 
estimate of [12: 

(14) fj-2 = fao-i + fal-l^l, 

which is easily seen to be obtained when missing 
values of Y2 are imputed as predictions from the 
regression of Y2 on Y\, with regression coefficients 
estimated on the complete cases. 

A corresponding Bayesian analysis is obtained by 
adding conjugate prior distributions for fa and fa, 
and computing draws from the posterior distribu- 
tions of these parameters. The posterior distribu- 
tion of fa is based on standard Bayesian methods 
applied to the sample of n complete observations 
on Y\ — inverse-chi-squared for an, normal for ^\ 
given a\\, and Student's t for [ii. The posterior dis- 
tribution of fa is based on standard Bayesian meth- 
ods for the regression of Y2 on Yi applied to the r 
complete observations on Yi and Yi — inverse chi- 
squared for a 2 2-i, normal for (/?20-i, /320 i) given cr 2 2-i, 
and multivariate t for (/?20-i > fao-i ) • Draws (/U^ , a[^ , 

^20-1 'A211 > °22-i ) f rom these posterior distributions 
are simple to compute (see Little and Rubin, 2002, 
Chapter 7, for details). Draws from the posterior 
distribution of other parameters are then created in 
the same way as ML estimates, by expressing the 
parameters as functions of (fa, fa) and then substi- 
tuting draws. For example, a draw from the poste- 
rior distribution of ^2 is 

(15) $=f$ 1 +f&A*, 

a formula that mirrors the ML formula (14). 

This Bayesian approach is asymptotically equiva- 
lent to ML, but it has several useful features. First, 
prior knowledge about the parameters can be incor- 
porated in the prior distribution if this is available; 
if not, noninformative reference prior distributions 
can be applied. Second, the posterior distributions 
do a better job of capturing uncertainty in small 
samples; for example, the draws (15) incorporate t- 



like corrections, which are not provided by standard 
asymptotic ML calculations. Third, the draws yield 
immediate estimates of uncertainty, such as the pos- 
terior variance, and 95% credibility intervals. The 
factored likelihood approach does not yield conve- 
niently simple formulas for large sample variances 
based on the information matrix. These are easily 
approximated by draws (15), and are actually supe- 
rior (in a frequentist sense) to asymptotic variances 
since they reflect the uncertainty better. 

Computational advantages in simulating draws 
from the posterior distribution are modest in the 
current bivariate normal example, since there are 
not many parameters. These benefits are more sub- 
stantial in larger problems where the factored like- 
lihood trick can be applied. Suppose that there are 
K > 2 variables (Yi , Y2, ■ ■ . , Yk) such that (a) the da- 
ta have a monotone pattern, such that is always 
observed when Yfc+i is observed, for k = 1, . . . ,K — 1; 
and (b) the conditional distribution of (Y^\Yi, . . . , 
Yfc_i) has a distribution (not necessarily normal) 
with unknown parameters fa, for k = 1, . . . , K; and 
(c) the parameters (fa, . . . , 4>k) are distinct and have 
independent prior distributions. Draws from the pos- 
terior distribution of <f> = (fa, . . . , <$>k) can then be 
obtaining from a sequence of complete-data pos- 
terior distributions, with the posterior distribution 
of fa based on the subset of data with (Yi, . . . , Yk) 
observed (Little and Rubin, 2002, Chapter 7). This 
elegant scheme forms the basis for MI in the case 
of a monotone pattern. In particular, SAS PROC 
MI yields multiple imputations for normal models, 
where the regressions of on Yi, . . . , are not 
required to be linear and additive, as would be the 
case if the joint distribution was multivariate nor- 
mal. 

When the data are monotone but the parame- 
ters of the sequence of conditional distributions are 
not a-priori independent, or when the pattern is not 
monotone, these simple factored likelihood methods 
no longer apply, and draws from the posterior dis- 
tribution need an iterative scheme. Markov Chain 
Monte Carlo methods then come to the rescue, as 
in the next example. 

Example 2 (The multivariate normal model with 
a general pattern of missing values). Suppose ob- 
servations yi are assumed to be randomly sampled 
from a multivariate normal distribution, that is, 

Ui= (yn,---,yip) ~md N p (fj,,T,), 
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the normal distribution with mean fi and covariance 
matrix S. There are missing values, and let y hs,i, 
Umis,i denote respectively the set of observed and 
missing values for observation i. Given current draw 
e (dt) = ( At (*) > s(*)) of the parameters, missing val- 
ues (5) are drawn as 

(16) V^^PiVrisAv^oW), i = l,-..,n, 

which is the multivariate normal distribution of the 
missing variables given the observed variables in 
observation i, with parameters that are functions 
of 9^ dt \ readily computed using the sweep operator 
(Little and Rubin, 2002, Section 7.4). New param- 
eters (6) are drawn from the posterior distribution 
given the filled-in data, which is a standard Bayesian 
problem, namely, 

(17) ^ d ^~ p{ X\ Yohs ,Y^ +1 \ 
( /x (W)| S (d,t+i) ) _ p(At | S (<i > t+i) ) 

(18) 

v Y (d,t+i), 

1 obs , l m i s J , 

where (17) is a draw from an inverse Wishart dis- 
tribution, and (18) is a draw from a multivariate 
normal distribution. Details of these steps were orig- 
inally described in Tanner and Wong (1987) as part 
of their data augmentation algorithm, and they form 
the basis for the multiple imputation algorithm in 
SAS PROC MI, originally developed by Schafer 
(Schafer, 1997). Steps (16)-(18) are closely related 
to the EM algorithm for ML estimation, except that 
they lead to draws from the posterior distribution. 
When feasible as here, it is recommended to first 
program EM, and correct programming errors by 
checking that the likelihood increases with each it- 
eration, and then convert the EM algorithm into the 
Gibbs algorithm, essentially by replacing the condi- 
tional means of the missing values in the E-step by 
draws (16), and the complete-data ML parameters 
in the M-step by draws (17) and (18). MI based on 
this model is available in a variety of software, in- 
cluding SAS PROC MI. 

A frequentist statistician might compute ML esti- 
mates and associated standard errors based on the 
information matrix. However, the Gibbs algorithm 
outlined above is simpler than computing informa- 
tion-matrix based standard errors, which are not an 
immediate output from EM. So a frequentist can 
use the draws from Gibbs' algorithm to compute 
tests and confidence intervals for the parameters, ex- 
ploiting the asymptotic equivalence of Bayes and fre- 
quentist inferences (Little and Rubin, 2002, Chap- 



ter 6). As in the previous example, the Bayesian 
approach improves some aspects of small sample in- 
ference by including i-like corrections reflecting un- 
certainty in the variance parameters. 

Example 2 allows missing data to be multiply im- 
puted for a general pattern of missing values, rather 
than the monotone pattern in Example 1. A limi- 
tation is that it assumes a multivariate normal dis- 
tribution for the set of variables with missing values 
(normality can be relaxed for variables that are com- 
pletely observed). This is a relatively strong para- 
metric assumption — in particular, it assumes that 
the regressions of missing variables on observed vari- 
ables are normal, linear and additive, which is not 
very appealing when a missing variable is binary or 
regressions involve interactions, for example. 

One approach to this problem is to modify the 
model to allow for mixtures of continuous and cate- 
gorical variables. The general location model of Olkin 
and Tate (1961) provides a useful starting point 
(Little and Rubin, 2002, Chapter 14). This is useful, 
but the need to formulate a tractable joint distribu- 
tion for the variables is restrictive. A more flexible 
approach is to apply MI for a sequence of condi- 
tional regression models for each missing variable, 
given all the other variables. This sequential regres- 
sion multivariate imputation (SRMI) method is only 
approximately Bayes, but what it loses in theoreti- 
cal coherence it gains in practical flexibility It is the 
topic of the next example. 

Example 3 (Sequential regression multivariate 
imputation). Suppose we have a general pattern 
of missing values, and (Y±,Y2, ■ ■ ■ , Yk) are the set of 
variables with missing values, and X represents a set 
of fully observed variables. SRMI (Raghunathan et 
al., 2001; Van Buuren et al., 2006) specifies models 
for the distribution of each variable Yj given all the 
other variables Y\,.. ., Yj-\,Yj + \, . . . , Yk and X, in- 
dexed by parameters ipj , with density gj (Yj \X, Y\ , . . . , 
Yj-i, Yj + i, . . . , Yk, ipj), and a noninformative prior 
distribution for ipj . Missing values of Yj at iteration 
t + 1 are imputed according to the following scheme: 
let be the observed or imputed value of Yj at it- 
eration t, and let Y"W denote the filled-in data set 
with imputations of Y\_, . . . ,Yj—i from iteration t + 1 
and imputations of Yj, . . . , Yk from iteration t. For 
j = 1, . . . , K, create new imputations of the missing 
values of Yj as follows: 

ipf ~p(^|A, yfr'*)), the posterior distribution 
of given data (X,Y^); 
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v (t+1) ~ a-(Y\X v [t+sl) v {t+1) v {t) v [t) 
) ' ^ is missin g! i = 1, • • • , n. 

This algorithm is repeated for t = 1,2,3,... until 
the imputations are stable; typically, more than one 
chain is run to facilitate assessment of convergence 
(Gelman and Rubin, 1992). The algorithm is then 
repeated D times to create D multiply-imputed data 
sets, and inferences are based on standard MI com- 
bining rules. 

The positive feature of SRMI is that it reduces 
the multivariate missing data problems into a set 
of univariate problems for each variable given all 
the other variables, allowing flexibility in the choice 
of model for each incomplete variable; that is, non- 
linear and interaction terms are allowed in the re- 
gressions, and the error distribution can be cho- 
sen to match the nature of the outcome — logistic 
for a binary variable, and so on. The drawback is 
that the regression models for each variable given 
the others does not generally correspond to a coher- 
ent model for the joint distribution of (Yi, . . . , Yk) 
given X. Thus, the Mi's are not draws from a well- 
defined posterior distribution. This does not seem 
to be a major problem in practice, and SRMI is 
a flexible and practical tool for handling a vari- 
ety of missing data problems. Software is available 
(Raghunathan, Solenberger and Van Hoewyk, 2009; 
MICE, 2009). 

The regression models in SRMI are parametric, 
and potentially vulnerable to model misspecifica- 
tion. As noted in Section 3.4, one recent interest in 
missing data research has been the development of 
robust methods that do not involve strong paramet- 
ric assumptions. My last example concerns so-called 
"doubly-robust methods" for missing data. 

Example 4 (Robust modeling: Penalized spline 
of propensity prediction). For simplicity, we con- 
sider the case where missingness is confined to a sin- 
gle variable Y. Let (xn, . . . , Xi p , yi) be a vector of 
variables for observation i, with j/j observed for i = 
1, . . . , r and missing for i = r + 1, . . . , n, and (xn, . . . , 
Xi p ) observed for i = 1, . . . ,n. We assume that the 
probability that yi is missing depends on (xn, ■ ■ ■ , Xi p ) 
but not yi, so the missing data mechanism is MAR. 
We consider estimation and inference for the mean 
of Y, fly = E(Y). Let mi denote the missing-data in- 
dicator for yi, with im = 1 when yi is missing and 
nii = when yi is observed. 

A number of robust methods involve the propen- 
sity to be observed, estimated by a logistic or pro- 
bit regression ofMonli,..., X p (Rosenbaum and 



Rubin, 1983; Little, 1986). In particular, propensity 
weighting computes the mean of the complete cases, 
weighted by the inverse of the estimated probability 
that Y is observed. Propensity weighting can yield 
estimates with large variances, and more efficient es- 
timates are obtained by predicting the missing val- 
ues of Y based on a model, with robustness sup- 
plied by a calibration term that weights the resid- 
uals from the complete cases (Robins, Rotnitzky 
and Zhao, 1994; Rotnitzky, Robins and Scharfstein, 
1998; Bang and Robins, 2005; Scharfstein, Rotnit- 
sky and Robins, 1999; Kang and Schafer, 2007). In 
this context, an estimator is doubly robust (DR) 
if it is consistent if either (a) the prediction model 
relating Y to X\ , . . . , X p is correctly specified or 
(b) the model for the propensity to respond is cor- 
rectly specified. In my last example, we describe 
a Bayesian missing data method, called Penalized 
Spline of Propensity Prediction (PSPP), that has 
a DR property. 

Define the logit of the propensity score for yi to 
be observed as 

(19) p*(ip) = logit (Pr(mj = 0\x a , . . .,x ip ;ij))), 

where ip denotes unknown parameters. The PSPP 
method is based on the balancing property of the 
propensity score, which means the missingness of yi 
depends only on (xn, . . . ,Xi p ) only through the pro- 
pensity score, under the MAR assumption (Rosen- 
baum and Rubin, 1983). Given this property, the 
mean of Y can be written as 

(20) fi y = E[(l - mi)yi] + E\rm x E( Vi \p*{$))\. 

Thus, the missing data can be imputed conditioning 
on the propensity score. This leads to the Penalized 
Spline of Propensity Prediction Method (PSPP) (Lit- 
tle and An, 2004; Zhang and Little, 2009). Imputa- 
tions in this method are predictions from the follow- 
ing model: 

(Vi \P* O) , xn , . . . , x ip ] V>, i3, 4>, a 2 ) 

(21) ~J\T(spl(pJty),0) 

where N(v, a 2 ) denotes the normal distribution with 
mean v and constant variance a 2 . The first compo- 
nent of the mean function, spl(p|(^),/3), is a spline 
function of the propensity score p*(^). The second 
component g(p*,Xi2, Xi p ; (f>) is a parametric func- 
tion, which includes any covariates other than p* 
that predict yi. One of the predictors, here xn, is 
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omitted from the g-function to avoid multicollinear- 
ity. 

A variety of spline functions can be chosen; we 
choose a penalized spline (Eilers and Marx, 1996; 
Ruppert, Wand and Carroll, 2003) of the form 

spl(K(^),/3) = /3o + /3iP*W 

(22) 

K 

where 1, p*(ip), (p?(^) - ki)+, . . . , (p?(^) - K fc )+ is 
the truncated linear basis; «!<•••< k# are selected 
fixed knots and IT is the total number of knots, and 
(fa, ■ ■ ■ i Pk+i) are random effects, assumed normal 
with mean and variance r 2 . This model can be fit- 
ted by ML using a number of existing software pack- 
ages, such as PROC MIXED in SAS (SAS, 1992; 
Ngo and Wand, 2004; Littell et al., 1996) and lme(-) 
in S-plus (Pinheiro and Bates, 2000). The first step 
of fitting a PSPP model estimates the propensity 
score, for example, by a logistic regression model or 
probit model of M on X\ , . . . , X p ; in the second step, 
the regression of Y on P* is fit as a spline model with 
the other covariates included in the model paramet- 
rically in the g-function. When Y is a continuous 
variable we choose a normal distribution with con- 
stant variance. For other types of data, extensions 
of the PSPP can be formulated by using the gener- 
alized linear models with different link functions. 

The average of the observed and imputed values 
of Y has a DR property, meaning that the predicted 
mean of Y is consistent if either (a) the mean of yi 
given (p*(ip),xn, . . . ,Xi p ) in model (3) is correctly 
specified, or (bl) the propensity p*(ip) is correctly 
specified, and (b2) = spl(p*(^),y3). 

The robustness feature derives from the fact that the 
regression function g does not have to be correctly 
specified, and the spline part of the regression func- 
tion involves a weak parametric assumption, prac- 
tically similar to the DR property mentioned above 
(Little and An, 2004; Zhang and Little, 2009). 

How does Bayes play into these methods? The 
missing values of can be multiply-imputed un- 
der this model, but note that these imputations in- 
volve a substantial number of unknown parameters, 
namely, the regression coefficients and variances (f3, 
t 2 ) of the spline model, the regression coefficients cj> 
in the parametric component g, the residual varian- 
ce <t 2 , and the nuisance parameters ip in the propen- 
sity function. Uncertainty in these parameters is rea- 
dily propagated under the Bayesian paradigm by 



drawing them from their posterior distributions, 
which is reasonably straightforward using a Gibbs' 
sampler. 

7. CONCLUSION 

I began this article by summarizing some argu- 
ments in favor of the CB approach to statistical in- 
ference, which to my mind incorporates the best fea- 
tures of both Bayesian and frequentist statistics. In 
short, inferences should be Bayesian, but model de- 
velopment and checking requires careful attention 
to frequentist properties of the resulting Bayesian 
inference. This CB "roadmap" is not a complete so- 
lution, since the interplay between model develop- 
ment and model inference, involving questions such 
as what range of model uncertainty should be in- 
cluded as part of formal statistical inference (Draper, 
1995), remains illusive and hard to pin down. How- 
ever, I find the CB approach a very satisfying basis 
for approaching practical statistical inference. 

In the remainder of the article I argued that the 
CB approach is particularly attractive for dealing 
with problems of missing data. In a sense, all of in- 
ferential statistics is a missing data problem, since it 
involves making inferences about something that is 
unknown and hence "missing" ; in that broad sense, 
I am merely restating the previous paragraph. How- 
ever, if missing data is considered more restrictively 
as referring to situations where the data matrix is in- 
complete, or partial information is available on some 
variables, then the Bayesian paradigm is conceptu- 
ally straightforward, since likelihoods do not require 
a fully-recorded rectangular data set. 

Simply put, Bayesian statistics involves generat- 
ing predictive distributions of unknowns given the 
data. Applied to missing data, it requires a predic- 
tive distribution for the missing data given the ob- 
served data. Multiple imputations are simply draws 
from this predictive distribution, and can be used 
for other analyses if a good model is chosen for the 
predictive distribution. 

In Sections 4 and 5 I outlined the basic theory un- 
derlying Bayes inference and MI with missing data, 
emphasizing the role of the MAR assumption. An 
important extension of this theory is to problems 
of coarsened data, where some values in the data 
set are rounded, grouped or censored (Heitjan and 
Rubin, 1991; Heitjan, 1994). This theory has con- 
nections with the concept of noninformative censor- 
ing that underlies many methods of survival analysis 
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with censored data. MI can be applied in these set- 
tings (Hsu et al., 2006), and is particularly useful 
for conditioning imputations on auxiliary variables 
not included in the primary analysis. 

In Section 6 I illustrated Bayesian approaches to 
missing data, mainly for normal models, in view of 
their practical importance and historical interest. I 
emphasize that Bayesian methods are also useful 
for non-normal missing data problems. The SRMI 
methods are not restricted to normal models, and 
Bayes and/or MI can be applied to categorical data 
models and mixtures of categorical and continuous 
variables (Little and Rubin, 2002, Chapters 13 and 
14), and generalized linear models with missing co- 
variates (Ibrahim et al., 2005). Bayesian hierarchical 
models are also natural for longitudinal data (Gilks 
et al., 1993; Ibrahim and Molenberghs, 2009) and 
small area estimation (Ghosh and Rao, 1994), with 
or without missing data. 

In the examples I focused on MAR models, but 
Bayesian approaches to NMAR models are also very 
appealing. A key problem when data are not missing 
at random is lack of identifiability of the model, and 
Bayesian methods provide a formal solution to the 
problem by allowing the formulation of prior dis- 
tributions for unidentified parameters that reflect 
the uncertainty (Rubin, 1977; Daniels and Hogan, 
2008). Resulting inferences are arguably superior to 
frequentist methods based on sensitivity analysis, 
where unidentified parameters are varied over a ran- 
ge. For a recent illustration of these ideas, see Long, 
Little and Lin (2010), who apply Bayesian missing 
data methods to handle noncompliance in clinical 
trials. 

I have focused here more on the Bayesian part of 
CB, given the emphasis of the workshop that moti- 
vated this article on Bayesian tools. Concerning the 
calibrated part, good frequentist properties require 
realistic models for the predictive distribution of the 
missing values, and this requires attention to check- 
ing the fit of the model to the observed data, and 
sensitivity analyses to assess the impact of depar- 
tures from MAR. Gelman et al. (2005) and Abay- 
omi, Gelman and Levy (2008) provide useful meth- 
ods for model checking for multiple imputation, and 
more work in this area would be useful. 
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